When we talk about Web3 data, what are we talking about

#Web3.0 #Infrastructure #Data

To figure out what we're talking about when we're talking about Web3 data, we first need to figure out what data looks like in Web2. We will discuss it from the full life cycle of data generation, collection, storage, management and use. Before that, let's first figure out how data is defined.

Data is classified into personal information, public data and legal person data in the "Guidelines for the Practice of Cyber Security Standards - Guidelines for Data Classification and Classification" (Draft for Comments-v1.0-202109) issued by the National Information Security Standardization Technical Committee of China. The specific definitions and examples are as follows.

Data Classification	Category Definition	Example
Public data	Data collected and generated by public management and service agencies in the process of performing public management and service duties in accordance with the law, and data collected and generated by other organizations and individuals in the provision of public services that involve public interests	Such as government affairs data, and Provide data related to public interests in public services such as water supply, power supply, gas supply, heating supply, public transportation, elderly care, education, medical care, postal services, etc.
Personal information	Various information recorded electronically or otherwise relating to an identified or identifiable natural person, excluding anonymized information	such as personally identifiable information, personal biometric information, personal property information, personal communications information, personal location information, personal health and physiological information, etc.
Legal Person Data	Data collected and generated by an organization in the process of production, operation and internal management	Such as business data, business management data, system operation and safety data, etc.

Above each category, it is further divided into 5 levels according to the harm object and degree of data leakage: public level (level 1), internal level (level 2), sensitive level (level 3), important level (level 4) and Core Level (Level 5). For public-level data, it is more like a public good, non-rival and non-exclusive. This type of data is generally provided by government/public organizations for which the benefit belongs, such as weather forecasts, macroeconomic data, etc.

1. Data generation, collection and storage

Public data, personal data and legal entity data are mostly generated in our daily use of computer applications, among which personal data and legal entity data are most relevant to ordinary users. So how is personal data and legal entity data generated and collected? A highly abstract Internet product system architecture diagram is shown below.

Web2 Application Architecture

The bottom-level database stores the data passed from the back-end and generated by the interaction between the user and the front-end. Broadly speaking, these are user data. As far as mobile applications are concerned, data can be roughly divided into the following categories:

User information, the user-related information recorded by the user using the application service, including the user's identity information, device, network, geographic location and even the list of applications installed on the mobile device, etc. Buried to collect.
Content data, the data generated by the user using the application service, these include any non-personal information content data that the user actively interactively writes on the application, which are part of the application service and are generally collected directly by the server data sheet.
Behavior data, the data generated by the user's interaction during the use of the application, these include the user's behavioral habits during the use of the application, such as viewing time, click rate, penetration rate, sliding situation, etc., which are generally collected by buried points .
Log data, the data generated by the application itself during the user's use of the application, these include the application's crash log, etc.
Code data, non-user interaction data includes front-end and back-end code. Like user data, these data are stored on a centralized server somewhere.

In this category, user information belongs to personal information data, and log and code data belong to legal entity data. Among them, content data and behavior data are worth discussing. In the Web2 era, they are more divided into their own business data by centralized entities, that is, legal person data.

Is there any difference in the application of Web3? Preethi Kasireddy's Web3 Product Architecture can help us understand.

Web3 Product Architecture Source: Preethi Kasireddy

Compared with Web2 applications, the user terminal and front-end are almost unchanged. The difference lies in the backend and the database. Users interact with node providers through the front end (rather than a centralized server), access contract code deployed on blockchains such as Ethereum (rather than the backend environment on the server), and interact. In this process, the above-mentioned types of data will also be generated. Due to the difference in technical architecture, the data generated by Web3 is not stored by a centralized server, and the storage methods of data generated in different ways may have similarities and differences.

All the data generated by the interaction of smart contracts are published on the blockchain and can be accessed by anyone. Therefore, it becomes a public product, including asset information, transaction data and contract codes. In theory, as long as the blockchain block space is large enough, any data can be stored on the blockchain, and even some projects are trying to use the blockchain as a database to store data.

Data Classification	Type Definition	Storage	Examples
Asset Information	Relevant data related to the native asset information of the blockchain	On-chain	Such as token addresses, attributes, links and even content information
Transaction data	Relevant data related to blockchain native asset transactions	On-chain/off-chain	Such as the address, time, amount, and interactive contract information of both parties to the token transaction
Contract code	Code data related to blockchain smart contract protocol	On-chain	Such as ECR-20 contract standard

At the current stage, in addition to the above three types of data, most of the data generated by a Web3 application is still stored in a centralized server, including front-end code, user information, content data, behavior data, and log data. This is because the current storage infrastructure is not perfect, the project party is either limited by technical problems, or has adopted a centralized solution to ensure access speed. With the continuous development of infrastructure, there are many more and more powerful storage infrastructures, such as IFPS, Stroj, Filecoin and Ceramic, etc., and more and more applications have begun to deploy themselves on decentralized storage, such as Deploy the front-end website on IPFS and then access it through ENS, so as to build a decentralized website front-end, and use Arweave to perpetuate file data such as pictures corresponding to the NFT project, and so on.

In general, when building a Web3 application, developers usually have 3 options for storing application data:

To store it on the blockchain, this option is very expensive, it will make the application as simple as possible, and the data is completely open. The advantage is that the application sovereignty is most directly protected.
Store the smart contract logic on the blockchain, and the rest on the traditional backend. This approach sacrifices user sovereignty and risks centralization. This is the way most Web3 apps do today.
Store the smart contract logic on the blockchain and other storages such as IPFS, Arweave, and Ceramic, and manage and update data through smart contracts. This method is expensive (Ceramic is currently free) and temporarily slow. But this approach preserves the sovereign nature of the application.

At present, the vast majority of Web3 applications are built through the second method, and there are some specific applications that can already be built using the third method, and very few applications are built through the first method. So, which way should we choose to store it? What kind of storage method is the trend?

Trend: Decentralized Storage - Data and Application Sovereignty

When it comes to 3 ways to build Web3 applications, there is one key word: sovereignty. This term is an inescapable topic when we talk about the characteristics of Web3, generally including data sovereignty and application sovereignty. So is sovereignty important? This is another topic, which is not discussed in this article. Interested can refer to related articles, such as "Web3 Data Market Outlook" and "Web3 - Let the " right to data "awaken". Here I want to cut into the only way to establish Web3 sovereignty from the perspective of data, and deduce the direction and focus of infrastructure development.

Regarding data sovereignty, it includes digital asset sovereignty and user data sovereignty. "Vertical Liquidity: How Value Interconnects" The article talks about how tokens can define the user's digital asset sovereignty (identity, relationship and property rights) , which is determined by a broad consensus that is hard to tamper with. Most fundamentally, the definition of these rights can be done by the blockchain itself, such as which address a token belongs to. But once it comes to the ownership of more complex digital product rights, there will be many problems, a typical one is the storage problem of pictures (or articles, etc.) corresponding to NFTs, "[NFT: Revolution of Digital Ownership](https:// This issue is discussed in mirror.xyz/bubai.eth/KAgr16vN8IRtjHARFQ6Wx35OmgLj9Oh2qcRkiiNw_OE)". The status quo of most NFTs is that their digital counterparts are stored on a centralized server somewhere, [once the server crashes or gets hacked](https://moxie.org/2022/01/07/web3-first-impressions .html) , then all the user has is a string of on-chain hashes, and the real "item" behind the hash can be stolen or replaced at any time, making it worthless.

In addition, user data sovereignty, as one of the most obvious dividing lines between Web2 and Web3, is a banner for the innovation and progress of Web3. In this regard, Ceramic envisions a data universe, a composable, web-scale data ecosystem owned by everyone but not Unique to anyone. User data follows the user from one app to another, and the user acts as the hub to control their own digital universe. At present, there are almost no applications that can achieve this, Cyberconnect has made a good attempt, it creates a decentralized social graph protocol Interoperability between users' social relationship data. But at present, the application does not guarantee the data sovereignty of users, although they have begun to transfer to Ceramic for construction, but everything is still on the way.

Regarding application sovereignty, some people refer to sovereign applications as "hyperstructures", which are unstoppable, free, valuable, scalable, permissionless, positive externalities and Features such as trusted neutrality, which combine to provide a public good in the digital world, creating the infrastructure of the "metaverse" (if you believe it). At present, most of the so-called Web3 applications do not have a high degree of application sovereignty, they are not real public goods, they can be easily sanctioned and changed by power, [tornado cash incident](https://home.treasury.gov/news /press-releases/jy0916) very directly illustrates the problem. One of the main reasons is that although the contract codes of these application protocol layers are published on the blockchain, components such as front-end and domain names are still controlled by third-party centralized entities.

In order to realize data sovereignty and application sovereignty, the construction method of Web3 applications is very important. The basic starting point is storage. Where does the data exist, and how can it be stored to ensure that users can have sovereignty? In general, depending on the user's data type, there can be different solutions:

Users' asset information and transaction data should be public ledger data, and it is most important to ensure verifiability on the chain, but it is very important for applications such as Aztec to appear to ensure the privacy of users' transactions on the chain. worth.
Users' user information, content data and behavioral data are regarded as personal information, and it is very important to ensure the user's right to control. With the user's consent, these data can be selectively disclosed as public goods to discover positive results. externality.
Log data and code data are used as legal person data, privatization is acceptable and necessary, but when it comes to Web3 infrastructure applications of the "super building" category, it should have the characteristics of public infrastructure. Features, the storage of application code should be open and have the ability to resist censorship beyond the platform level.

At present, the reason why most Web3 applications adopt "storage of smart contract logic on the blockchain, and others on the traditional backend" is that there is currently no good enough decentralized infrastructure to replace the original centralized infrastructure solution.

First, decentralized storage such as IPFS, Filecoin, and Arweave are static storage, which makes them lack computing and state management capabilities, and cannot implement more advanced database-like functions (such as mutability, version control, access control, and programmable logic). ), and although Ceramic is dynamic storage, which solves these problems to a certain extent, the current access speed of Ceramic is still relatively slow, the development kit is not perfect, and its degree of decentralization has been criticized.

The main function of decentralized storage such as IPFS, Filecoin, and Arweave is to statically store unstructured data such as pictures, documents, and static codes. Because of its hard-to-tamper characteristics, it guarantees digital data such as NFT to a certain extent. Sovereignty, once the link between the on-chain hash code and the off-chain storage address is established, it is difficult to be influenced by external forces in extraordinary ways. The front-end code is built on it, which also promotes the integrity of application sovereignty. However, since the storage technology at the current stage is only storage, the lack of computing power has led to its functional support far behind the centralized server solution.

The current mainstream decentralized storage situation on the market is shown in the following table, the data is from CoinmarketCap, August 23, 2022. This table refers to “Web3 Decentralized Storage Evolution History” to summarize and update.

Project name	Launch time	Features	Storage type	Storage fee	Security
Storj	2017	File Storage	Cloud Storage	$0.0039/GB/month	Encryption for Privacy, File Slicing for Distributed Storage	STORJ	215
IPFS	2017	File Storage	Incentive Layer/Protocol/Static Storage	-	-	-	-
Filecoin	2020	File Storage	Static Storage	$1e-7/GB/month	Proof of Replication prevents cheating, Proof of Spacetime prevents data deletion	FIL	1678
Arweave	2020	File Storage	Static Storage/Persistent Storage	$3.77/GB *Persistent Storage	Persistent Storage, No Privacy Solution yet	AR	404
Stratos	Testnet	File Storage/Computation/Database Storage	Static Storage/Database Storage	-	Three Levels and Three Consensus to Solve the Impossible Triangle of Security, Efficiency and Decentralization	STRO	5.2
Ceramic	2021	Data Stream Storage	Data Stream Storage/Incentive Layer	-	Data Selective Encryption Ensures Privacy, User-Centric Ensures User Data Autonomous Control	-	-

At present, most storage solutions only implement a "decentralized hard disk", which meets the most basic requirements, but more advanced computing requirements such as storage-based computing are not fully satisfied. These calculations include local development environment rendering , insertion and extraction of data streams, etc., these are the most commonly used and necessary functional modules in current Web2 applications. Ceramic's innovation based on data flow storage enables data rights management, version control, dynamic storage and composability, Stratos is trying to provide a more complete, complete set of solutions, including multiple modules such as database storage, static storage, computing and consensus. In addition, Arweave and Filecoin are also aware of the importance of computing, and are building or encouraging ecosystem-related modules, such as Filecoin [FVM has been launched to support computing on Filecoin](https://filecoin.io/blog/posts/ introducing-the-filecoin-virtual-machine/) .

2. Data management

Building Web3 applications on decentralized storage makes them less susceptible to external interference, breaking monopoly and power. However, storage alone is not enough. It also requires the support of technologies such as rendering computing, data processing, permission configuration and privacy protection in the storage environment to ensure the sovereignty of applications and the data sovereignty of users, so as to realize the personalization of the digital world. The rise of sovereignty. Especially access control and privacy protection issues, they should be implemented with a high-level sovereign technical solution. These levels of data in Web2 applications are stored on some specific centralized servers according to different security protection levels. Their security is guaranteed by network security, and their sovereignty is guaranteed by platforms (such as enterprise platforms, government platforms, etc.) . In this data management mode, users are subordinate to the super administrator, and the user has no rights to the data itself; in addition, data security is also subject to the centralized entity of the super administrator, such as the public security data leakage incident in a certain area some time ago, a super administrator leaked his private key, leaking the personal private information of hundreds of millions of people.

Web3 data management should have the following two characteristics:

Data sovereignty guarantee. This should go beyond the platform level or even the world level, and ensure the common rights of users in the digital world through world-class consensus. The guarantee in the traditional world is platform-level, and the rules come from non-consensus. A platform-level company can control all the rules and regulations and can change them at any time, thereby infringing on the individual sovereignty of users at any time.
Data Privacy Guarantee. The privacy and security of user data is mathematically guaranteed through cryptography, not through database network security. Selective encryption controlled by users is one of the basic rights of user data sovereignty.

How Web3 data is managed depends on how the data is stored.

Project Name	Technical Foundation	Consensus Layer	Management Mode	Privacy Protection
IPFS	IPFS	None	Content Centric	None
Filecoin	IPFS	Yes	Content-Centric	Third Party Project Support
Ceramic	IPFS	None
Arweave	Blockchian	Yes	Perpetual, trackable uploader	Third-party project support
Stratos	Blockchain	Yes	Balance of the two, dynamic adjustment	Third-party project support

IPFS and Filecoin are content-centric, access stored content through Content ID (CID), and build third-party applications for data management on this basis, such as through [ChainSafe Files](https://filecoin.io/zh -cn/blog/posts/chainsafe-files/) , after the single sign-on problem can be solved in a localized way, the data can be encrypted and stored easily through asymmetric encryption. The content-centric management model makes user management difficult, and how to assign data ownership becomes more complicated. In addition to providing storage, Filecoin's ecological scalability will be much higher than other underlying layers. Especially after the launch of FVM, there may be some special tools for some vertical fields of data storage and data retrieval, which can help users to help enterprises better manage some of their data and ensure data. security, and then develop a lot of some new applications.

Ceramic is also based on IPFS, but is user-centric, built on the IDX Protocol, 3ID DID method (CIP-79) With the Ceramic-native account system, it can be used to authenticate Ceramic. Users can use the blockchain wallet to control the 3ID DID to execute transactions on the data stream and manage their own data. This is achieved by associating the DID with the data and storing it in the data model. The data model defines the schema of the user data, which is shared by all applications using the same data model.

Arweave is a one-time, permanent storage decentralized data storage project on the chain. The data is openly and transparently stored on the chain, and anyone can access it. The data stored on the chain can be browsed through the Arweave blockchain browser. The data management in this mode is exactly the same as the data on the management chain. There is no access control and "hot update" of the original data. Every time the data is updated, its index address will change. There is no problem with IPFS and Filecoin. However, the advantage is that it is very clear to which user the data belongs, which is conducive to backtracking of data rights and interests.

Stratos is also a storage based on blockchain consensus. It will maintain an index tree specifically to record the path of data storage, so as to keep track of data updates. Different from Arweave, each storage node (Resource Node) of Stratos is designed to have computing power, storage and content access control services at the same time. The project party will build a database based on the blockchain itself for dynamic data throughput. The form and management mode are close to the decentralized cloud computer.

Trend: Decentralized Data Market

In the case where users have data ownership, the data market is an inevitable trend, and data is circulated in it as a capital element. There has been an attempt at data market on Filecoin, Fivehive by the decentralized application development studio [OB1](https://ob1.io/index. html) is built and maintained, an open source marketplace that supports uploading, maintaining, purchasing and/or transferring datasets. The project Github has stopped updating and maintaining two years ago, with a high probability of failure.

Ceramic's Data Model Market

Ceramic mentions in their Dataverse the open data model market they want to build, because data needs interoperability, it can greatly boost productivity improvement. Such a data model market is achieved through an urgent consensus on a data model, similar to the ETC contract standard in Ethereum, from which developers can choose as a function template to have an application that conforms to all the data for that data model. For now, such a market is not a trading market.

Regarding the data model, a simple example is that in a decentralized social network, the data model can be simplified into 4 parameters, namely:

PostList: Stores an index of user posts
Post: store a single post
Profile: store the user's data
FollowList: Store the user's follow list

So how can data models be created, shared and reused on Ceramic for data interoperability across applications?

Ceramic provides a data model registry ( DataModels Registry ), an open source, community-built, reusable application data model for Ceramic repository. Here, developers can publicly register, discover, and reuse existing data models—the basis for customer-operated applications built on shared data models. Currently, it is based on Github storage, and in the future it will be decentralized on Ceramic.

All data models added to the registry are automatically published under the npm plugin package at @datamodels. Any developer can use @datamodels/model-name to install one or more data models, making these models available to store or retrieve data at runtime using any IDX client, including DID DataStore or Self.ID.

In addition, Ceramic also built a DataModels Forum based on Github, each model in the data model registry has its own discussion thread on this forum, the community It can be used to comment and discuss. It is also a place for developers to post their ideas on the data model to solicit community input before adding it to the registry. At present, everything is in the early stage, and there are not many data models in the registry. The data model included in the registry should become a CIP standard through the evaluation of the community, just like the smart contract standard of Ethereum, which provides a possibility for data. compositionality.

Ocean's data trading market

Ocean Protocol has established a decentralized data service supply chain network with the data trading market as the core. The diagram below shows the main services required to create a data service supply chain, providing data, algorithms, computation, storage, analysis, and curation. These components are tied to service enforcement agreements (such as service-level agreements), secure computing, access control, and licensing.

Source：Ocean Protocol

The main participating roles are data consumers, service providers, marketplaces, service publishers, validators and curators. Ocean provides a full set of data science tools. Data users can build data service pipelines on Ocean to automatically run data algorithms for data processing and value discovery. In this process, data users cannot download all datasets and see all datasets, thus protecting the datasets from being stolen, and users purchase the right to use the datasets, rather than owning the datasets.

Source：Ocean Protocol

In addition to this, Ocean is also partnering with other institutions to build data marketplaces, such as its partnership with Mercedes-Benz's decentralized data marketplace [Acentrik]((https://acentrik.io/) in its recently launched Enterprise Teamed up in Release.Acentrik Marketplace is powered by OceanONDA V4 smart contracts and libraries to publish data services, deploy and mint data tokens and Acentrik asset management tokens, and spend data services.

3. Data usage and stack

Based on the above understanding, we propose the Web3 data stack, as shown in the figure below.

The bottom layer is where data sources are stored, including decentralized storage, on-chain and off-chain data, etc.
The second is the management applications for these data, including databases, data tables, index middleware, and data markets.
Under a certain data management paradigm, data mining can be performed, including algorithm modeling, statistical analysis and data visualization.

Web3 Data Stack Source:Zonff Partners

At present, most of the data usage of Web3 in the industry is on-chain data. There are endless data analysis tools and indexing tools. The huge gold mine of on-chain data has been fully tapped. Most of them are on-chain data mining, and only a few involve off-chain data. In general, the data usage link is an ETLA (Extract, Transform, Load, Analysis) process, and each node has representative items. The representative of the Extract project is The Graph, the representative of the project of the Transform (Transform) into a usable data table and the Load (Load) link is Dune and Luabsae, and the representative of the Analysis (Analysis) is Nansen and NFTGO.

On the other hand, in decentralized storage, ETLA's entire process of supporting projects is almost a desert, with only some extraction projects, and there are huge opportunities and challenges here. ** The Graph and the Ceramic community themselves are working on extracting the data on Ceramic, the founder of Orbis also tried to make a Cerscan for browsing the data on Ceramic. Arweave can already read and manage the data stored in Arweave with subgraphs through The Graph, and there are also related third-party projects on Filecoin doing this. However, the process of TLA is still unknown. The biggest reason is that the data stored in different decentralized storages is highly heterogeneous, and it is difficult to have a unified model to mine the value of these data. Among them, the most promising This step is Ceramic, because the existence of its advanced data model reduces the heterogeneity of data on Ceramic exponentially, thus making the data more available. **

In addition to on-chain data, there are many projects trying to connect on-chain data and off-chain data. Such projects can be regarded as "chain reform" projects. Type classifications are:

Web2 data sovereignty and trading market: Itheum, Navigate , Swash and Phyllo etc. This type of project mainly combines traditional Internet data with data on the chain, hoping to open up the information exchange between Web2 and Web3. Common practices are to export Web2 data and import it into a designated data pool or directly bind traditional Internet social accounts, etc. .
Enterprise data consensus: Authtrail, this project integrates with the internal database of the enterprise and joins the consensus layer to achieve Tamper-proof and traceable data within the enterprise.
On-chain and off-chain data combination: Space and Time, this project will integrate off-chain databases like Authtrail, but there is no consensus layer, more on-chain In addition, Pool is also doing similar things.

The usage paradigm of Web3 data is significantly different from that of Web2, which mainly lies in the way data is aggregated, that is, different types of data are stored, indexed, extracted, integrated and utilized in different ways. According to the previous classification, here are some brief summaries:

Public data: including public data and some legal person data classified in the "Guidelines for Cybersecurity Standards Practice - Guidelines for Data Classification and Grading". As a public product, it is data that can be publicly mined for value. Access does not require permission, but user ownership can be traced to trace airdrop profits. Typical examples are on-chain data and non-encrypted application data stored in decentralized storage. (such as user posts, likes and comments, etc.). The most important upstream support for its use is for indexing applications, such as The Graph, or applications for Web3 native databases, such as Tableland.
Private data: including personal information and some legal person data classified in the "Guidelines for Cybersecurity Standards Practice - Guidelines for Data Classification and Grading". As a data type that requires encrypted storage and certain privacy permission configuration, its access has permission and cannot be publicly obtained. If it is stored in decentralized storage and blockchain, permission-configurable encrypted storage is required. Or through other means, such as privacy technology means such as ZK, MPC and TEE. The most important upstream support for its use is database applications such as Kwil and Ceramic.